<<<<<<< HEAD ======= >>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee

Chapter 1: Intro

ex 1.1. List five differences between raster and vector data.

  • raster data have values for pixels, vector data for points, lines of polygons
  • spatial locations of raster pixels are constrained to a grid, vector data coordinates can have arbitrary locations (only limited by floating point representation of coordinates)
  • raster data lend themselves well to represent spatially continuously observed variables (such as imagery) or spatially continuously varying variables (such as elevation or temperature); vector data lend themselves well to represent spatially discrete features such as houses and roads, or administrative regions.
  • raster data cover their spatial extent completely: every point is part of a single pixel; vector data may contain holes, or have intersecting geometries where points belong to multiple polygons.
  • the operations on raster data are often simple mathematical (raster algebra) operations that include spatial operations; such simple operations are usually not available for vector data.
  • raster data has trivial topology: it is clear which 4 or 8 pixels are the neighbours of every pixel; for vector data spatial there are more types of relationships, and these relationships are more complicated to detect.

The answer “Raster data is continuous data while vector data is discrete data.” is not complete: a raster of land use type represens a discrete (type) variable, a polygon map with population density represents a continuous variable. The difference lies in spatially continuous variables like elevation or temperature which are more easily represented by raster data, and spatially discrete features, such as houses and roads, which are easier represented by vector data.

ex 1.2. In addition to those listed below figure 1.1, list five further graphical components that are often found on a map.

  • scale bar
  • data source
  • well defined title, subtitle
  • orientation indicator, north arrow
  • further reference elements: seas, land mass, rivers

ex 1.3. Why the numeric information shown in figure 1.4 misleading (or meaningless):

The values shown in figure 1.4 are population total associated with their respective counties. Without the county boundaries the meaning disappears: raster pixels do not contain population totals per pixel, population totals over larger regions or populations densities can no longer be derived based on this raster map alone.

ex 1.4. Under which conditions would you expect strong differences when doing geometrical operations on \(S^2\), compared to doing them on \(R^2\)

  • when computing distances between two points at large distance from each other
  • when determining what the shortest line is between two points, in particular near to the poles, or when the antimeridian crosses this line

Chapter 2: Coordinates

ex 2.1. list three geographic measures that do not have a natural zero origin

  • longitude: the zero meridian is arbitrary, 100 years ago there were many other zero meridians fashionable
  • latitude: the equator may feel like a natural zero, but one could equally use the North Pole as zero, or choose entirely different origins and orientation for longitude and latitude.
  • altitude (measured w.r.t. mean sea level, geoid, or ellispoid)

ex 2.2 - 2.4

(thanks to Jonas Hurst)

Convert the (x, y) point s (10, 2), (-10, -1), (10, -2) and (0, 10) to polar cooridnates

cart2polar = function(x, y){
  r = sqrt(x*x + y*y)  # compute r (distance from origin)
  phi = atan2(y, x)  # compute phi (angle between point and positive x axis in rad)
  phi_deg = phi * 180 / pi  #  compute angle in deg
  result = c(r, phi_deg)
  return(result)
}

cart2polar(10, 2)
## [1] 10.19804 11.30993
cart2polar(-10, -1)
## [1]   10.04988 -174.28941
cart2polar(10, -2)
## [1]  10.19804 -11.30993
cart2polar(0, 10)
## [1] 10 90

Convert from Polar to Cartesian

Convert the polar (r, phi) points (10, 45°), (0, 100°) and (5, 259°) to Cartesian coordinates

deg2rad = function(angle_degree) {
  angle_degree * pi / 180
}

polar2cart = function(r, phi_deg){
  # phi must be in degrees
  phi_rad = deg2rad(phi_deg)  # convert phi in degrees to radians
  x = r * cos(phi_rad)
  y = r * sin(phi_rad)
  c(x, y) # return value
}

polar2cart(10, 45)
## [1] 7.071068 7.071068
polar2cart(0, 100)
## [1] 0 0
polar2cart(5, 259)
## [1] -0.954045 -4.908136

assuming the Earth is a sphere with a radius of 6371 km, compute for (lambda, phi) points the great circle distance between (10, 10) and (11, 10), between (10, 80) >and (11, 80), between (10, 10) and (10, 11) and between (10, 80) and (10, 81).

distOnSphere = function(l1, phi1, l2, phi2, radius) {
  l1_rad = deg2rad(l1)
  l2_rad = deg2rad(l2)
  phi1_rad = deg2rad(phi1)
  phi2_rad = deg2rad(phi2)

  theta = acos(
    sin(phi1_rad) * sin(phi2_rad) +
    cos(phi1_rad) * cos(phi2_rad) * cos(abs(l1_rad - l2_rad))
  )
  radius * theta # return value
}

radius = 3671
distOnSphere(10, 10, 11, 10, radius)
## [1] 63.09763
distOnSphere(10, 80, 11, 80, radius)
## [1] 11.12568
distOnSphere(10, 10, 10, 11, radius)
## [1] 64.07104
distOnSphere(10, 80, 10, 81, radius)
## [1] 64.07104

Unit of all results are kilometers.

Chapter 3: Geometries

(thanks to Jannis Fröhlking)

ex 3.1 Give two examples of geometries in 2-D (flat) space that are not simple feature geometries, and create a plot of them.

library(sf)
<<<<<<< HEAD
## Linking to GEOS 3.9.1, GDAL 3.3.2, PROJ 7.2.1
=======
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee
x1 <- st_linestring(rbind(c(0,0),c(2,2),c(0,2),c(2,0)))
x2 <- st_polygon(list(rbind(c(3,0),c(5,2),c(3,2),c(5,0),c(3,0))))
plot(c(x1,x2), col = 2:3)

st_is_simple(x1)
## [1] FALSE
st_is_simple(x2)
## [1] FALSE

ex 3.2 Recompute the coordinates 10.542, 0.01, 45321.789 using precision values 1, 1e3, 1e6, and 1e-2.

for(i in c(1,1e3,1e6,1e-2)) 
  print(round(i * c(10.542, 0.01, 45321.789))/i)
## [1]    11     0 45322
## [1]    10.542     0.010 45321.789
## [1]    10.542     0.010 45321.789
## [1]     0     0 45300

ex 3.3 Describe a practical problem for which an n-ary intersection would be needed.

  • for a long-term set of polygons with fire extents, find the polygons that underwent 0, 1, 2, 3, … fires
  • for a set of extents of n individual plant species, find polygons with 0, 1, …, n species, or find the polygon(s) that contain a particular subset of plant species.

ex 3.4 How can you create a Voronoi diagram (figure 3.3) that has closed polygons for every point?

Voronoi diagrams have “open polygons”, areas that extend into infinity, for boundary points. These cannot be represented by simple feature geometries. st_voronoi chooses a default (square) polygon to limit the extent, which can be enlarged. Alternatively, the extent can be limited using st_intersection on its result:

library(sf)
par(mfrow = c(2,2))
set.seed(133331)
mp = st_multipoint(matrix(runif(20), 10))
plot(st_voronoi(mp), col = NA, border = 'black')
plot(mp, add = TRUE)
title("default extent")
e2 = st_polygon(list(rbind(c(-5,-5), c(5, -5), c(5,5), c(-5, 5), c(-5,-5))))
plot(st_voronoi(mp, envelope = e2), col = NA, border = 'black')
plot(mp, add = TRUE)
title("enlarged envelope")
e3 = st_polygon(list(rbind(c(0,0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))))
v = st_voronoi(mp) %>% st_collection_extract() # pulls POLYGONs out of GC
plot(st_intersection(v, e3), col = NA, border = 'black', axes=TRUE)
plot(mp, add = TRUE)
title("smaller, intersected envelope")
<<<<<<< HEAD

=======

>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee

ex 3.5 Give the unary measure dimension for geometries POINT Z (0 1 1), LINESTRING Z (0 0 1,1 1 2), and POLYGON Z ((0 0 0,1 0 0,1 1 0,0 0 0))

st_dimension(st_point(c(0,1,1)))
## [1] 0
st_dimension(st_linestring(rbind(c(0,1,1),c(1,1,2))))
## [1] 1
st_dimension(st_polygon(list(rbind(c(0,0,0),c(1,0,0),c(1,1,0),c(0,0,0)))))
## [1] 2

(these are all zero-dimensional geometries because they are points, irrespective the number of dimensions they’re defined in)

ex 3.6 Give the DE-9IM relation between LINESTRING(0 0,1 0) and LINESTRING(0.5 0,0.5 1); explain the individual characters.

line_1 = st_linestring(rbind(c(0,0),c(1,0)))
line_2 = st_linestring(rbind(c(.5,0),c(.5,1)))
plot(line_1,col = "green")
plot(line_2,col = "red", add = TRUE)

st_relate(line_1, line_2)
##      [,1]       
## [1,] "F01FF0102"

The DE-9IM relation is F01FF0102

  • F Intersection of green lines interior and red lines interior is empty
  • 0 Intersection of green lines interior and red lines boundary results in one point in the middle of the green line
  • 1 Intersection of green lines interior and red lines exterior results in a line covering most parts of the green line
  • F Intersection of green lines boundary and red lines interior is empty
  • F Intersection of green lines boundary and red lines boundary is empty
  • 0 Intersection of green lines boundary and red lines exterior results in the two boundary points of the green line
  • 1 Intersection of green lines exterior and red lines interior results in a line covering most parts of the red line
  • 0 Intersection of green lines exterior and red lines boundary results in the upper boundary point of the red line
  • 2 Intersection of green lines exterior and red lines results in a polygonal geometry covering everything except the two lines

(the boundary of a LINESTRING is formed by its two end points)

ex 3.7 Can a set of simple feature polygons form a coverage? If so, under which constraints?

Yes, but I would say that the set may just contain one polygon, because simple features provide no way of assigning points on the boundary of two adjacent polygons to a single polygon.

ex 3.8 For the nc counties in the dataset that comes with R package sf, find the points touched by four counties.

# read data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source `/home/rsb/lib/r_libs/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
# get intersections
(nc_geom = st_geometry(nc))
## Geometry set for 100 features 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
## First 5 geometries:
## MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
## MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
## MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
## MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
## MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
nc_ints = st_intersection(nc_geom)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
plot(nc_ints, main = "All intersections")
<<<<<<< HEAD

=======

>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee
# Function to check class of intersection objects
get_points = function(x){
  if(class(x)[2]=="POINT")  return(x)
}
# get points
points = lapply(nc_ints, get_points)
points[sapply(points,is.null)] <- NULL
sf_points = st_sfc(points)
st_crs(sf_points) = st_crs(nc)
# get points with four neighbouring geometries (=states)
touch = st_touches(sf_points, nc_geom)
four_n = sapply(touch, function(y) which(length(y)==4))
names(four_n) = seq_along(four_n)
point_no = array(as.numeric(names(unlist(four_n))))
result = st_sfc(points[point_no])
plot(nc_geom, main = "Points touched by four counties")
plot(result, add = TRUE, col = "red", pch = 10, cex = 2)

A more compact way might be to search for points where counties touch another county only in a point, which can be found using st_relate using a pattern:

(pts = nc %>% st_relate(pattern = "****0****"))
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## Sparse geometry binary predicate list of length 100, where the
## predicate was `relate_pattern'
## first 10 elements:
##  1: (empty)
##  2: (empty)
##  3: (empty)
##  4: (empty)
##  5: (empty)
##  6: (empty)
##  7: (empty)
##  8: (empty)
##  9: 31
##  10: 26
nc %>% st_relate(pattern = "****0****") %>% lengths() %>% sum()
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## [1] 28

which is, as expected, four times the number of points shown in the plot above.

How can we find these points? See here:

nc = st_geometry(nc)
s2 = sf_use_s2(FALSE) # use GEOM geometry
## Spherical geometry (s2) switched off
pts = st_intersection(nc, nc)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
pts = pts[st_dimension(pts) == 0]
plot(st_geometry(nc))
plot(st_geometry(pts), add = TRUE, col = "red", pch = 10, cex = 2)

sf_use_s2(s2) # set back
## Spherical geometry (s2) switched on

ex 3.9 How would figure 3.6 look like if delta for the y-coordinate was positive?

Only cells that were fully crossed by the red line would be grey:

library(stars)
## Loading required package: abind
ls = st_sf(a = 2, st_sfc(st_linestring(rbind(c(0.1, 0), c(1, .9)))))
grd = st_as_stars(st_bbox(ls), nx = 10, ny = 10, xlim = c(0, 1.0), ylim = c(0, 1),
   values = -1)
attr(grd, "dimensions")$y$delta = .1
attr(grd, "dimensions")$y$offset = 0 
r = st_rasterize(ls, grd, options = "ALL_TOUCHED=TRUE")
r[r == -1] = NA
plot(st_geometry(st_as_sf(grd)), border = 'orange', col = NA, 
     reset = FALSE, key.pos=NULL)
plot(r, axes = TRUE, add = TRUE, breaks = "equal") # ALL_TOUCHED=FALSE;
plot(ls, add = TRUE, col = "red", lwd = 2)

The reason is that in this case, lower left corners of grid cells are part of the cell, rather than upper left corners.

Chapter 4: Spherical geometry

ex 4.1 Straight GeoJSON lines

How does the GeoJSON format define “straight” lines between ellipsoidal coordinates (section 3.1.1)? Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection? How could this geometry be modified to have it cross the North Pole?

GeoJSON defines straight lines between pairs of ellipsoidal coordinates as the straight line in Cartesian space formed by longitude and latitude. This means e.g. that all parallels are straight lines.

Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection?

Like a half circle:

library(sf)
l <- st_as_sfc("LINESTRING(0 85,180 85)") %>%
    st_segmentize(1) %>%
    st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
     graticule = TRUE, axes = TRUE, reset = FALSE)

How could this geometry be modified to have it cross the North Pole?

One would have to let it pass through (0 90) and (180, 90):

library(sf)
l <- st_as_sfc("LINESTRING(0 85,0 90,180 90,180 85)") %>%
    st_segmentize(1) %>%
    st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
     graticule = TRUE, axes = TRUE, reset = FALSE)

ex 4.2 For a typical polygon on \(S^2\), how can you find out ring direction?

Ring direction (clock-wise CW, counter clock-wise CCW) is unambiguous on \(R^2\) but not on \(S^2\): on \(S^2\) every polygon divides the sphere’s surface in two parts. When the inside of the polygon is taken as the area to the left when traversing the polygons’s points then for a small polygon, then ring direction is CCW if the area of the polygon is smaller than half of the area of the sphere. For polygons dividing the sphere in two equal parts (great circles such as the equator or meridians) ring direction is ambiguous.

ex 4.3 Are there advantages of using bounding caps over using bounding boxes? If so, list them.

Bounding caps may be more compact (have a smaller area compared to the bounding box corresponding to the same geometries), they need fewer parameters, and they are invariant under rotation of (the origins of) longitude and latitude.

For areas covering one of the poles, a bounding box will always need to have a longitude range that spans from -180 to 180, irrespective whether the geometry is centered around the pole.

ex 4.4 Why is, for small areas, the orthographic projection centered at the area a good approximation of the geometry as handled on \(S^2\)

Because that is the closest approximation of the geometry on \(R^2\).

ex 4.5 Fiji

For rnaturalearth::ne_countries(country = "Fiji", returnclass="sf"), check whether the geometry is valid on \(R^2\), on an orthographic projection centered on the country, and on \(S^2\). How can the geometry be made valid on S^2? Plot the resulting geometry back on \(R^2\). Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.

Valid on \(R^2\):

fi = rnaturalearth::ne_countries(country = "Fiji", returnclass="sf") %>%
        st_geometry()
s2 = sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
st_is_valid(fi)
## [1] TRUE

Valid on orthographic projection:

ortho = "+proj=ortho +lon_0=178.6 +lat_0=-17.3"
st_transform(fi, ortho) %>% st_is_valid()
## [1] FALSE
plot(st_transform(fi, ortho), border = 'red')

The red line following the antimeridian makes the geometry invalid in this projection, and also on \(S^2\):

sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
st_is_valid(fi)
## [1] FALSE

Make valid on \(S^2\), and plot:

fi.s2 = st_make_valid(fi)
st_is_valid(fi.s2)
## [1] TRUE
plot(st_transform(fi.s2, ortho), border = 'red')
title("valid")

where we see that the line at the antimeridian has disappeared. This makes plotting in \(R^2\) look terrible, with lines spanning the globe:

plot(fi.s2, axes = TRUE)

Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.

sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
(c1 = st_centroid(fi))
## Warning in st_centroid.sfc(fi): st_centroid does not give correct centroids for
## longitude/latitude data
## Geometry set for 1 feature 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 163.8532 ymin: -17.31631 xmax: 163.8532 ymax: -17.31631
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (163.8532 -17.31631)
sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
(c2 = st_centroid(fi.s2))
## Geometry set for 1 feature 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 178.5684 ymin: -17.31562 xmax: 178.5684 ymax: -17.31562
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (178.5684 -17.31562)
st_distance(c1, c2)
## Units: [m]
##         [,1]
## [1,] 1561723
sf_use_s2(s2)

Chapter 5: Attributes

ex 5.1. type of State

The appropriate value would be constant: there is no identity relationship of State to one of the counties in nc, and the value of State is constant through each county in the state (every point in every county in the state has this value for State).

ex 5.2. type of State for the entire state

Now, the unioned geometry is that of the state, and we can assign identity: there is only one state of North Carolina, an this geometry is its geometry.

ex 5.3. the AREA variable

The nc dataset is rather old, and did not come with an extensive report how, in detail, certain variables such as AREA were derived, so some detective work is needed here. How did people do this, more than three decades ago?

We can now compute area by

library(sf)
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
nc$AREA[1:10]
##  [1] 0.114 0.061 0.143 0.070 0.153 0.097 0.062 0.091 0.118 0.124
s2 = sf_use_s2(FALSE) # use spherical geometry:
## Spherical geometry (s2) switched off
nc$area = a_sph = st_area(nc)
nc$area[1:10]
## Units: [m^2]
##  [1] 1137388604  611077263 1423489919  694546292 1520740530  967727952
##  [7]  615942210  903650119 1179347051 1232769242
sf_use_s2(TRUE) # use ellipsoidal geometry:
## Spherical geometry (s2) switched on
nc$area = a_ell = st_area(nc)
nc$area[1:10]
## Units: [m^2]
##  [1] 1137107793  610916077 1423145355  694378925 1520366979  967504822
##  [7]  615794941  903423919 1179065710 1232475139
sf_use_s2(s2) # set back to original
cor(a_ell, a_sph)
## [1] 0.9999999

and this gives the area, in square metres, computed using either ellipsoidal or spherical geometry. We see that these are not identical, but nearly perfectly linearly correlated.

A first hypothesis might be a constant factor between the area and AREA variables. For this, we could try a power of 10:

nc$area2 = units::drop_units(nc$area / 1e10)
cor(nc$AREA, nc$area2)
## [1] 0.9998116
summary(lm(area2 ~ AREA, nc))
## 
## Call:
## lm(formula = area2 ~ AREA, data = nc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.281e-03 -6.279e-04  6.328e-05  5.495e-04  2.746e-03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0009781  0.0002692  -3.633 0.000448 ***
## AREA         1.0138124  0.0019882 509.920  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0009733 on 98 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 2.6e+05 on 1 and 98 DF,  p-value: < 2.2e-16
plot(area2 ~ AREA, nc)
abline(0, 1)

and we see a pretty good, close to 1:1 correspondence! But the factor 1e10 is strange: it does not convert square metres into a usual unit for area, neither for metric nor for imperial units.

Also, there are deviations from the 1:1 regression line. Could these be explained by the rounding of AREA to three digits? If rounding to three digits was the only cause of spread around the regression line, we would expect a residual standard error similar to the standard deviation of a uniform distribution with width .001, which is

sqrt(0.001^2/12)
## [1] 0.0002886751

but the one obtained int he regression is three times larger. Also, the units of AREA would be 1e10 \(m^2\), or 1e4 \(km^2\), which is odd and could ring some bells: one degree latitude corresponds roughly to 111 km, so one “square degree” at the equator corresponds roughly to \(1.11^2 \times 10^4\), and at 35 degrees North roughly to

111 ^ 2 * cos(35 / 180 * pi)
## [1] 10092.77

which closely corresponds to the regression slope found above.

We can compute “square degree” area by using the \(R^2\) area routines, e.g. obtained when we set the CRS to NA:

nc2 = nc
st_crs(nc2) = NA
nc2$area = st_area(nc2) # "square degrees"
plot(area ~ AREA, nc2)
abline(0,1)

cor(nc2$area, nc2$AREA)
## [1] 0.999983
summary(lm(area ~ AREA, nc2))
## 
## Call:
## lm(formula = area ~ AREA, data = nc2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.471e-04 -2.265e-04 -9.880e-06  2.714e-04  4.594e-04 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 7.436e-05  7.965e-05    0.934    0.353    
## AREA        9.996e-01  5.882e-04 1699.395   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002879 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.888e+06 on 1 and 98 DF,  p-value: < 2.2e-16

We now get a much better fit, a near perfect correlation, and a regression standard error that corresponds exactly to what one would expect after rounding AREA to three digits.

A further “red flag” against the constant (1e10) conversion hypothesis is the spatial pattern of the regression residuals obtained by the first approach:

nc$resid = residuals(lm(area2 ~ AREA, nc))
plot(nc["resid"])

these residuals clearly show a North-South trend, corresponding to the effect that the Earth’s curvature has been ignored during the computation of AREA (ellipsoidal coordinates were treated as if they were Cartesian). “Square degrees” become smaller when going north.

The “unit” of the AREA variable is hence “square degree”. This is a meaningless unit for area on the sphere, because a unit square degree does not have a constant area.

ex 5.4 type of area

“area” is of type aggregate: it is a property of a polygon as a whole, not of each individual point in the polygon. It is extensive: if we cut a polygon in two parts, the total area is distributed over the parts.

ex 5.5 area-weighted interpolation

From the on-line version of the book we get the code that created the plot:

g = st_make_grid(st_bbox(st_as_sfc("LINESTRING(0 0,1 1)")), n = c(2,2))
par(mar = rep(0,4))
plot(g)
plot(g[1] * diag(c(3/4, 1)) + c(0.25, 0.125), add = TRUE, lty = 2)
text(c(.2, .8, .2, .8), c(.2, .2, .8, .8), c(1,2,4,8), col = 'red')

A question is how we can make g into an sf object with the right attribute values associated with the right geometries. We try values 1:4:

sf = st_sf(x = 1:4, geom = g)
plot(sf)

and see the order of the geometries: row-wise, bottom row first, so

sf = st_sf(x = c(1,2,4,8), geom = g)
plot(sf)

gives us the source object. We create target geometries by

dashed = g[1] * diag(c(3/4, 1)) + c(0.25, 0.125)
box = st_union(g)
c(dashed, box)
## Geometry set for 2 features 
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
## POLYGON ((0.25 0.125, 0.625 0.125, 0.625 0.625,...
## POLYGON ((0 0.5, 0 1, 0.5 1, 1 1, 1 0.5, 1 0, 0...

and can call st_interpolate_aw to compute the area-weighted interpolations:

st_interpolate_aw(sf, c(dashed, box), extensive = TRUE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##       x                       geometry
## 1  1.75 POLYGON ((0.25 0.125, 0.625...
## 2 15.00 POLYGON ((0 0.5, 0 1, 0.5 1...
st_interpolate_aw(sf, c(dashed, box), extensive = FALSE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = FALSE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##          x                       geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((0 0.5, 0 1, 0.5 1...

This generates a warning, which we can get rid of by setting the agr to constant:

st_agr(sf) = "constant"
st_interpolate_aw(sf, c(dashed, box), FALSE)
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS:           NA
##          x                       geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((0 0.5, 0 1, 0.5 1...

Chapter 6: Data Cubes

ex 6.1.

Why is it difficult to represent trajectories, sequences of \((x,y,t)\) obtained by tracking moving objects, by data cubes as described in this chapter?

  • rounding \((x,y,t)\) to the discrete set of dimension values in a data cube may cause loss of information
  • if the dimensions all have a high resolution, data loss is limited but the data cube will be very sparse; this will only be effective if a system capable of storing sparse data cubes is used (e.g. SciDB, TileDB)

ex 6.2.

In a socio-economic vector data cube with variables population, life expectancy, and gross domestic product ordered by dimensions country and year, which variables have block support for the spatial dimension, and which have block support for the temporal dimension?

  • population has spatial block support (total over an area), typically not temporal block support (but the population e.g. on a particular day of the year)
  • life expectancy is calculated over the total population of the country, and as such has spatial block support; it has temporal block support as the number of deaths over a particular period are counted, it is not clear whether this always corresponds to a single year or a longer period.
  • GDP has both spatial and temporal block support: it is a total over an area and a time period.

ex 6.3.

The Sentinel-2 satellites collect images in 12 spectral bands; list advantages and disadvantages to represent them as (i) different data cubes, (ii) a data cube with 12 attributes, one for each band, and (iii) a single attribute data cube with a spectral dimension.

  • as (i): it would be easy to cope with the differences in cell sizes;
  • as (ii): one would have to cope with differences in cell sizes (10, 20, 60m), and it would not be easy to consider the spectral reflectance curve of individual pixels
  • as (iii): as (ii) but it would be easier to consider (analyse, classify, reduce) spectral reflectance curves, as they are now organized in a dimension

ex 6.4.

Explain why a curvilinear raster as shown in figure 1.5 can be considered a special case of a data cube.

  • Curvilinear grids do not have a simple relationship between dimension index (row/col, i/j) to coordinate values (lon/lat, x/y): one needs both row and col to find the coordinate pair, and from a coordinate pair a rather complex look-up to find the corresponding row and column.

ex 6.5.

Explain how the following problems can be solved with data cube operations filter, apply, reduce and/or aggregate, and in which order. Also mention for each which function is applied, and what the dimensionality of the resulting data cube is (if any):

ex 6.5.1

from hourly \(PM_{10}\) measurements for a set of air quality monitoring stations, compute per station the amount of days per year that the average daily \(PM_{10}\) value exceeds 50 \(\mu g/m^3\)

  • convert measured hourly values into daily averages: aggregate (from hourly to daily, function: mean)
  • convert daily averages into TRUE/FALSE whether the daily average exceeds 50: apply (function: larger-than)
  • compute the number of days: reduce time (function: sum)

This gives a one-dimensional data cube, with dimension “station”

ex 6.5.2

for a sequence of aerial images of an oil spill, find the time at which the oil spill had its largest extent, and the corresponding extent

  • for each image, classify pixels into oil/no oil: apply (function: classify)
  • for each image, compute size (extent) of oil spill: reduce space (function: sum)
  • for the extent time series, find time of maximum: reduce time (function: which.max, then look up time)

This gives a zero-dimensional data cube (a scalar).

ex 6.5.3

from a 10-year period with global daily sea surface temperature (SST) raster maps, find the area with the 10% largest and 10% smallest temporal trends in SST values.

  • from daily SST to trend values per pixel: reduce time (function: trend function, lm)
  • from trend raster, find 10- and 90-percentile: reduce space (function: quantile)
  • using percentiles, threshold the trend raster: apply (function: less than / more than)

This gives a two-dimensional data cube (or raster layer: the reclassified trend raster).

Chapter 7: sf, stars

ex. 7.1

Find the names of the nc counties that intersect LINESTRING(-84 35,-78 35); use [ for this, and use st_join() for this.

(file = system.file("gpkg/nc.gpkg", package="sf"))
## [1] "/home/rsb/lib/r_libs/sf/gpkg/nc.gpkg"
nc = st_read(file)
## Reading layer `nc.gpkg' from data source `/home/rsb/lib/r_libs/sf/gpkg/nc.gpkg' using driver `GPKG'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
line = st_as_sfc("LINESTRING(-84 35,-78 35)", crs = st_crs(nc))
nc[line,]$NAME
##  [1] "Jackson"     "Mecklenburg" "Macon"       "Sampson"     "Cherokee"   
##  [6] "Cumberland"  "Union"       "Anson"       "Hoke"        "Duplin"     
## [11] "Richmond"    "Clay"        "Scotland"
st_join(st_sf(line), nc)$NAME # left join: `line` should be first argument
##  [1] "Jackson"     "Mecklenburg" "Macon"       "Sampson"     "Cherokee"   
##  [6] "Cumberland"  "Union"       "Anson"       "Hoke"        "Duplin"     
## [11] "Richmond"    "Clay"        "Scotland"

ex. 7.2

Repeat this after setting sf_use_s2(FALSE), and compute the difference (hint: use setdiff()), and color the counties of the difference using color ‘#00880088’.

# save names first:
sf_use_s2(TRUE)
names_with_s2 = nc[line,]$NAME
sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
nc[line,]$NAME
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
##  [1] "Macon"      "Sampson"    "Cherokee"   "Cumberland" "Union"     
##  [6] "Anson"      "Hoke"       "Duplin"     "Richmond"   "Clay"      
## [11] "Scotland"
(diff = setdiff(names_with_s2, nc[line,]$NAME))
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## [1] "Jackson"     "Mecklenburg"
plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)

ex. 7.3

Plot the two different lines in a single plot; note that R will plot a straight line always straight in the projection currently used; st_segmentize can be used to add points on straight line, or on a great circle for ellipsoidal coordinates.

plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)
plot(line, add = TRUE)
plot(st_segmentize(line, units::set_units(10, km)), add = TRUE, col = 'red')

To show that the red line is curved, but only curved in plate carree, and not e.g. in an orthographic projection centered at this region, we can also plot it in an orthographic projection:

l.gc = st_segmentize(line, units::set_units(10, km))
l.pc = st_segmentize(st_set_crs(line, NA), 0.1) %>% st_set_crs(st_crs(l.gc))
o = st_crs("+proj=ortho +lon_0=-80 +lat_0=35")
plot(st_transform(st_geometry(nc), o), axes = TRUE)
plot(st_transform(st_geometry(nc), o)[nc$NAME %in% diff],
     col = "#00880088", add = TRUE)
plot(st_transform(l.gc, o), col = 'red', add = TRUE)
plot(st_transform(l.pc, o), col = 'black', add = TRUE)
plot(st_transform(line, o), col = 'green', add = TRUE)

The fact that the unsegmented line line is straight (R plotted it as straight, it contains only the two endpoints) and that it covers the red line supports that in this plot, the great circle line (red) is plotted straight, and the “straight in plate carree” line is not.

ex. 7.4

NDVI, normalized differenced vegetation index, is computed as (NIR-R)/(NIR+R), with NIR the near infrared and R the red band. Read the L7_ETMs.tif file into object x, and distribute the band dimensions over attributes by split(x, "band"). Then, add attribute NDVI to this object by using an expression that uses the NIR (band 4) and R (band 3) attributes directly.

library(stars)
(x = read_stars(system.file("tif/L7_ETMs.tif", package = "stars")))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7_ETMs.tif     1      54     69 68.91242      86  255
## dimension(s):
##      from  to  offset delta                     refsys point values x/y
## x       1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [x]
## y       1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [y]
## band    1   6      NA    NA                         NA    NA   NULL
(x.spl = split(x))
## stars object with 2 dimensions and 6 attributes
## attribute(s):
##     Min. 1st Qu. Median     Mean 3rd Qu. Max.
## X1    47      67     78 79.14772      89  255
## X2    32      55     66 67.57465      79  255
## X3    21      49     63 64.35886      77  255
## X4     9      52     63 59.23541      75  255
## X5     1      63     89 83.18266     112  255
## X6     1      32     60 59.97521      88  255
## dimension(s):
##   from  to  offset delta                     refsys point values x/y
## x    1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [x]
## y    1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [y]
x.spl$NDVI = (x.spl$X4 - x.spl$X3)/(x.spl$X4 + x.spl$X3)
plot(x.spl["NDVI"])

ex. 7.5

Compute NDVI for the L7_ETMs.tif image by reducing the band dimension, using st_apply and an a function ndvi = function(x) { (x[4]-x[3])/(x[4]+x[3]) }. Plot the result, and write the result to a GeoTIFF.

ndvi_fn = function(x) { (x[4]-x[3])/(x[4]+x[3]) }
ndvi = st_apply(x, 1:2, ndvi_fn)
plot(ndvi)

write_stars(ndvi, "ndvi.tif")

an alternative function is

ndvi_fn = function(x1,x2,x3,x4,x5,x6) { (x4-x3)/(x4+x3) }
ndvi2 = st_apply(x, 1:2, ndvi_fn)
all.equal(ndvi, ndvi2)
## [1] TRUE

This latter function can be much faster, as it is called for chunks of data rather than for individual pixels.

ex. 7.6

Use st_transform to transform the stars object read from L7_ETMs.tif to EPSG:4326. Print the object. Is this a regular grid? Plot the first band using arguments axes=TRUE and explain why this takes such a long time.

(x_t = st_transform(x, 'EPSG:4326'))
plot(x_t[,,,1], axes = TRUE)

the report says it is a curvilinear grid. Plotting takes so long because for curvilinear grids, each cell is converted to a small polygon and then plotted.

ex. 7.7

Use st_warp to warp the L7_ETMs.tif object to EPSG:4326, and plot the resulting object with axes=TRUE. Why is the plot created much faster than after st_transform?

x_w = st_warp(x, crs = 'EPSG:4326')
plot(x_w[,,,1], reset = FALSE)
plot(st_as_sfc(st_bbox(x_w)), col = NA, border = 'red', add = TRUE)

Plotting is faster now because we created a new regular grid. Note that the grid border does not align perfectly with the square formed by the bounding box (using straight lines in an equidistant rectangular projection): white grid cells indicate the misalignment due to warping/transforming.

ex. 7.8

Using a vector representation of the raster L7_ETMs, plot the intersection with a circular area around POINT(293716 9113692) with radius 75 m, and compute the area-weighted mean pixel values for this circle. Compare the area-weighted values with those obtained by aggregate using the vector data, and by aggregate using the raster data, using exact=FALSE (default) and exact=FALSE. Explain the differences.

l7 = st_as_sf(x)
st_agr(l7) = "constant"
a = st_as_sfc("POINT(293716 9113692)", crs = st_crs(l7)) %>%
    st_buffer(units::set_units(74, m))
plot(st_intersection(l7, a))
<<<<<<< HEAD

=======

>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee
(aw = st_interpolate_aw(l7, a, mean, extensive = FALSE))
## Simple feature collection with 1 feature and 6 fields
## Attribute-geometry relationship: 0 constant, 6 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: SIRGAS 2000 / UTM zone 25S
##   L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1       88.54362       73.50115       80.69708       49.98497       111.8357
##   L7_ETMs.tif.V6                       geometry
## 1       97.56707 POLYGON ((293790 9113692, 2...
(ag_vector  = aggregate(l7, a, mean))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: SIRGAS 2000 / UTM zone 25S
##   L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1        88.0303       72.84848       79.30303       50.33333       111.9697
##   L7_ETMs.tif.V6                       geometry
## 1       96.72727 POLYGON ((293790 9113692, 2...
(ag_rasterF = st_as_sf(aggregate(x, a, mean)))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: SIRGAS 2000 / UTM zone 25S
##   L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1           88.6          73.55           81.1          49.55          112.1
##   L7_ETMs.tif.V6                       geometry
## 1           98.4 POLYGON ((293790 9113692, 2...
(ag_rasterT = st_as_sf(aggregate(x, a, mean, exact = TRUE)))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: SIRGAS 2000 / UTM zone 25S
##   L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1       88.54362       73.50115       80.69708       49.98497       111.8357
##   L7_ETMs.tif.V6                            sfc
## 1       97.56707 POLYGON ((293790 9113692, 2...
rbind(st_drop_geometry(aw),
      st_drop_geometry(ag_vector), 
      st_drop_geometry(ag_rasterF), 
      st_drop_geometry(ag_rasterT))
##    L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1        88.54362       73.50115       80.69708       49.98497       111.8357
## 11       88.03030       72.84848       79.30303       50.33333       111.9697
## 12       88.60000       73.55000       81.10000       49.55000       112.1000
## 13       88.54362       73.50115       80.69708       49.98497       111.8357
##    L7_ETMs.tif.V6
## 1        97.56707
## 11       96.72727
## 12       98.40000
## 13       97.56707

Area-weighted interpolation computes the area-weighted mean of the areas shown in the plot; aggregate on the vector values computes the unweighted mean over all polygonized pixels that intersect with the circle (black lines); aggregate on the raster values only averages (unweighted) the cells with pixel centers intersecting with the circle (light red):

plot(st_geometry(l7)[a])
plot(a, add = TRUE, col = NA, border = 'red')
plot(st_as_sf(L7_ETMs[a])[1], add = TRUE, col = '#ff000066')
plot(st_as_sf(L7_ETMs[a], as_points = TRUE)[1], add = TRUE, pch = 3, col = 1)

Chapter 8: large datasets

ex 8.1

For the S2 image (above), find out in which order the bands are using st_get_dimension_values(), and try to find out (e.g. by internet search) which spectral bands / colors they correspond to.

f = "sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip"
  granule = system.file(file = f, package = "starsdata")
file.size(granule)
## [1] 769461997
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name, 
    ".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
library(stars)
(p = read_stars(s2, proxy = TRUE))
## stars_proxy object with 1 attribute in 1 file(s):
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "[...]/MTD_MSIL1C.xml:10m:EPSG_32632"
## 
## dimension(s):
##      from    to offset delta                refsys point    values x/y
## x       1 10980  3e+05    10 WGS 84 / UTM zone 32N    NA      NULL [x]
## y       1 10980  6e+06   -10 WGS 84 / UTM zone 32N    NA      NULL [y]
## band    1     4     NA    NA                    NA    NA B4,...,B8
st_get_dimension_values(p, "band")
## [1] "B4" "B3" "B2" "B8"

ex 8.2

Compute NDVI for the S2 image, using st_apply and an an appropriate ndvi function. Plot the result to screen, and then write the result to a GeoTIFF. Explain the difference in runtime between plotting and writing.

ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)
ndvi = st_apply(p, 1:2, ndvi_fn)
plot(ndvi)
## downsample set to 18

Alternatively, one could use

ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)

but that is much less efficient. Write to a tiff:

system.time(write_stars(ndvi, "ndvi.tif"))
## ================================================================================
##    user  system elapsed 
<<<<<<< HEAD
## 215.845   7.501  58.205
======= ## 98.194 7.495 37.234 >>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee

The runtime difference is caused by the fact that plot downsamples, so computes a very small fraction of the available pixels, where write_stars computes all pixels, and then writes them.

ex 8.3

Plot an RGB composite of the S2 image, using the rgb argument to plot(), and then by using st_rgb() first.

plot(p, rgb = 1:3)
## downsample set to 18

# plot(st_rgb(p[,,,1:3], maxColorValue=13600)) # FIXME: fails

ex 8.4

select five random points from the bounding box of S2, and extract the band values at these points. What is the class of the object returned? Convert the object returned to an sf object.

pts =  p %>% st_bbox() %>% st_as_sfc() %>% st_sample(5)
(p5 = st_extract(p, pts))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##                                Min. 1st Qu. Median    Mean 3rd Qu. Max.
## MTD_MSIL1C.xml:10m:EPSG_32632     0 1699.25 2209.5 2021.95  2837.5 3648
## dimension(s):
##          from to offset delta                refsys point
## geometry    1  5     NA    NA WGS 84 / UTM zone 32N  TRUE
## band        1  4     NA    NA                    NA    NA
##                                                         values
## geometry POINT (344398.1 5986860),...,POINT (318560.4 5980671)
## band                                                 B4,...,B8
class(p5)
## [1] "stars"
st_as_sf(p5)
## Simple feature collection with 5 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 318560.4 ymin: 5911044 xmax: 407394.7 ymax: 5996965
## Projected CRS: WGS 84 / UTM zone 32N
##     B4   B3   B2   B8                 geometry
## 1 3099 3165 3648 3088 POINT (344398.1 5986860)
## 2    0    0    0    0 POINT (407394.7 5911044)
## 3 2220 2199 2526 1973 POINT (370068.3 5947493)
## 4 1947 1806 2030 1379 POINT (404787.1 5996965)
## 5 2477 2731 3397 2754 POINT (318560.4 5980671)

ex 8.5

For the 10 km radius circle around POINT(390000 5940000), compute the mean pixel values of the S2 image when downsampling the images with factor 30, and on the original resolution. Compute the relative difference between the results.

b = st_buffer(st_sfc(st_point(c(390000, 5940000)), crs = st_crs(p)), 
    units::set_units(10, km))
plot(p[,,,1], reset = FALSE, axes = TRUE)
## downsample set to 18
plot(b, col = NA, border = 'green', add = TRUE)

p1 = st_as_stars(p, downsample = 30)
a1 = aggregate(p1, b, mean)

For the full resolution, this takes a while:

system.time(a2 <- aggregate(p, b, mean))
## Warning in c.stars(structure(list(`MTD_MSIL1C.xml:10m:EPSG_32632` =
## structure(c(1032.86211707414, : along argument ignored; maybe you wanted to use
## st_redimension?
##    user  system elapsed 
<<<<<<< HEAD
##  91.529   1.528  85.147
======= ## 43.843 1.316 41.428 >>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee

Relative differences: we will work on the array of the stars objects:

(a1[[1]] - a2[[1]])/((a1[[1]]+a2[[1]])/2)
##             [,1]         [,2]        [,3]         [,4]
## [1,] 0.001055567 0.0009431265 0.001103254 7.781836e-05

Alternatively one could convert a1 and a2 to a data.frame, using as.data.frame, and work on the third column of the data frames.

ex 8.6

Use hist to compute the histogram on the downsampled S2 image. Also do this for each of the bands. Use ggplot2 to compute a single plot with all four histograms in facets.

hist(p1)

hist(p1[,,,1])

hist(p1[,,,2])

hist(p1[,,,3])

hist(p1[,,,4])

library(ggplot2)
ggplot(as.data.frame(p1), aes(x = MTD_MSIL1C.xml.10m.EPSG_32632)) +
        geom_histogram() + facet_wrap(~band)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ex 8.7

Use st_crop to crop the S2 image to the area covered by the 10 km circle. Plot the results. Explore the effect of setting argument crop = FALSE

plot(st_crop(p, b))
## downsample set to 2

plot(st_crop(p, b, crop = FALSE))
## downsample set to 18

ex 8.8

With the downsampled image, compute the logical layer where all four bands have pixel values higher than 1000. Use a raster algebra expression on the four bands (use split first), or use st_apply for this.

p_spl = split(p1)
p_spl$high = p_spl$B4 > 1000 & p_spl$B3 > 1000 & p_spl$B2 > 1000 & p_spl$B8 > 1000
plot(p_spl["high"])

alternative, using st_apply on the band dimension

p2 = st_apply(p1, 1:2, function(x) all(x > 1000))
plot(p2)

Chapter 12: Spatial Interpolation

ex 12.1

Create a plot like the one in figure 12.13 that has the inverse distance interpolated map of figure 12.2 added on left side.

load the .Rmd of this chapter in rstudio, and run it all the way up to the chunk where figure 12.13 is created (note that this requires that the population density csv file is present).

load("ch12.RData")

The chunk preceding the one that creates the plot can be modified as follows, to add inverse distance interpolations to the kriging and residual kriging:

library(stars)
library(gstat)
kr = krige(NO2~sqrt(pop_dens), no2.sf, grd["pop_dens"], vr.m)
## [using universal kriging]
k$kr1 = k$var1.pred
k$kr2 = kr$var1.pred
i = idw(NO2~1, no2.sf, grd["values"])
## [inverse distance weighted interpolation]
k$kr0 = i$var1.pred
st_redimension(k[c("kr0", "kr1", "kr2")], 
    along = list(what = c("idw", "kriging", "residual kriging"))) %>%
    setNames("NO2") -> km

Next, the plot can be created identically, as the what dimension now contains the inverse distance “layer”:

library(ggplot2)
g + geom_stars(data = km, aes(fill = NO2, x = x, y = y)) + 
    geom_sf(data = st_cast(de, "MULTILINESTRING")) + 
    geom_sf(data = no2.sf) + facet_wrap(~what) +
    coord_sf(lims_method = "geometry_bbox")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ex 12.2

Create a scatter plot of the map values of the idw and kriging map, and a scatter plot of map values of idw and residual kriging.

For this we can e.g. convert the object \(k\) to a data.frame:

as.data.frame(k) |> head() # NA indicates cells outside AOI
##          x       y var1.pred var1.var kr1 kr2 kr0
## 1 285741.3 6096239        NA       NA  NA  NA  NA
## 2 295741.3 6096239        NA       NA  NA  NA  NA
## 3 305741.3 6096239        NA       NA  NA  NA  NA
## 4 315741.3 6096239        NA       NA  NA  NA  NA
## 5 325741.3 6096239        NA       NA  NA  NA  NA
## 6 335741.3 6096239        NA       NA  NA  NA  NA
k.df = as.data.frame(k)
plot(kr0 ~ kr1, k.df, xlab = "kriging", ylab = "idw")

plot(kr0 ~ kr2, k.df, xlab = "residual kriging", ylab = "idw")

ex 12.3

Carry out a block kriging by setting the block argument in krige(), and do this for block sizes of 10 km (the grid cell size), 50 km and 200 km. Compare the resulting maps of estimates for these three blocks sizes with those obtained by point kriging, and do the same thing for all associated kriging standard errors.

For the point and block kriging values:

b0 =   krige(NO2~1, no2.sf, grd["values"], v.m) # points kriging
## [using ordinary kriging]
b10 =  krige(NO2~1, no2.sf, grd["values"], v.m, block = c(1e4, 1e4))
## [using ordinary kriging]
b50 =  krige(NO2~1, no2.sf, grd["values"], v.m, block = c(5e4, 5e4))
## [using ordinary kriging]
b200 = krige(NO2~1, no2.sf, grd["values"], v.m, block = c(2e5, 2e5))
## [using ordinary kriging]
b10$points = b0$var1.pred
b10$b10 = b10$var1.pred
b10$b50 = b50$var1.pred
b10$b200 = b200$var1.pred
st_redimension(b10[c("points", "b10", "b50", "b200")], 
    along = list(what = c("points", "blocks: 10 km", 
                          "blocks: 50 km", "blocks: 200 km"))) %>%
    setNames("NO2") -> b
g + geom_stars(data = b, aes(fill = NO2, x = x, y = y)) + 
    geom_sf(data = st_cast(de, "MULTILINESTRING")) + 
    geom_sf(data = no2.sf) + facet_wrap(~what) +
    coord_sf(lims_method = "geometry_bbox")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

For the standard errors:

b10$pointse = sqrt(b0$var1.var)
b10$b10se = sqrt(b10$var1.var)
b10$b50se = sqrt(b50$var1.var)
b10$b200se = sqrt(b200$var1.var)
st_redimension(b10[c("pointse", "b10se", "b50se", "b200se")], 
    along = list(what = c("points", "blocks: 10 km", 
                          "blocks: 50 km", "blocks: 200 km"))) %>%
    setNames("NO2_SE") -> b
g + geom_stars(data = b, aes(fill = NO2_SE, x = x, y = y)) + 
    geom_sf(data = st_cast(de, "MULTILINESTRING")) + 
    geom_sf(data = no2.sf) + facet_wrap(~what) +
    coord_sf(lims_method = "geometry_bbox")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ex 12.4

Based on the residual kriging results obtained above, compute maps of the lower and upper boundary of a 95% confidence interval, when assuming that the kriging error is normally distributed, and show them in a plot with a single (joint) legend

b10$lower = b10$points - 1.96 * b10$pointse
b10$upper = b10$points + 1.96 * b10$pointse
st_redimension(b10[c("lower", "upper")], 
    along = list(what = c("lower boundary C.I.", "upper boundary C.I."))) %>%
    setNames("NO2_95CI") -> b
g + geom_stars(data = b, aes(fill = NO2_95CI, x = x, y = y)) + 
    geom_sf(data = st_cast(de, "MULTILINESTRING")) + 
    geom_sf(data = no2.sf) + facet_wrap(~what) +
    coord_sf(lims_method = "geometry_bbox")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ex 12.5

Compute and show the map with the probability that NO2 point values exceeds the level of 15 ppm, assuming a normal distribution.

For this we use pnorm(), which gives the cumulative area under the curve from minus infinity up to a given value; one minus that value gives the probability of exceeding it.

b10$p = 1 - pnorm(15, b10$points, b10$pointse)
g + geom_stars(data = b10, aes(fill = p, x = x, y = y)) + 
    geom_sf(data = st_cast(de, "MULTILINESTRING")) + 
    geom_sf(data = no2.sf, col = 'orange') +
    coord_sf(lims_method = "geometry_bbox") +
    guides(fill = guide_legend(title="p(NO2 > 15)"))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Chapter 14: Proximity and Areal Data

ex 14.1

If dimensionality (point/line/polygon) varies in the data set, geometries must be reduced to the lowest dimension present (usually points). If all the observations are polygonal (polygon or multipolygon), contiguities (shared boundaries) are a sparse and robust neighbour representation (spdep::poly2nb()). Polygons may also be reduced to points by taking for example centroids, but neighbours found by triangulating points may not be the same as contiguity neighbours for the polygons being represented by these centroids (spdep::tri2nb()). If the geometries are multipoint, they must also be reduced to a single point. If the geometries have point rather than areal support, for example real estate transaction data, k-nearest neighbour (spdep::knn2nb(spdep::knearneigh())), graph-based (spdep::graph2nb() applied to the output of spdep::soi.graph(), spdep::relativeneigh() or spdep::gabrielneigh()) and distance-based methods (spdep::dnearneigh()`) may be used.

ex 14.2

Graph-based functions for creating neighbour objects (spdep::tri2nb(), spdep::soi.graph(), spdep::relativeneigh() and spdep::gabrielneigh()) may not be used if the support of the observations is not that of points on the plane. All other functions may be used with both planar and spherical/elliptical geometries, but the neighbours generated may differ if a non-planar data set is treated as planar.

ex 14.3

A chessboard is an \(8 \times 8\) grid:

xy <- data.frame(expand.grid(1:8, 1:8), col=rep(c(rep(c("black", "white"), 4), rep(c("white", "black"), 4)), 4))
library(stars)
library(sf)
(xy %>% st_as_stars() %>% st_as_sf() -> grd)
## Simple feature collection with 64 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0.5 ymin: 0.5 xmax: 8.5 ymax: 8.5
## CRS:           NA
## First 10 features:
##      col                       geometry
## 1  white POLYGON ((0.5 8.5, 1.5 8.5,...
## 2  black POLYGON ((1.5 8.5, 2.5 8.5,...
## 3  white POLYGON ((2.5 8.5, 3.5 8.5,...
## 4  black POLYGON ((3.5 8.5, 4.5 8.5,...
## 5  white POLYGON ((4.5 8.5, 5.5 8.5,...
## 6  black POLYGON ((5.5 8.5, 6.5 8.5,...
## 7  white POLYGON ((6.5 8.5, 7.5 8.5,...
## 8  black POLYGON ((7.5 8.5, 8.5 8.5,...
## 9  black POLYGON ((0.5 7.5, 1.5 7.5,...
## 10 white POLYGON ((1.5 7.5, 2.5 7.5,...
library(spdep)
## Loading required package: sp
## Loading required package: spData
(rook <- poly2nb(grd, queen=FALSE))
## Neighbour list object:
## Number of regions: 64 
## Number of nonzero links: 224 
## Percentage nonzero weights: 5.46875 
## Average number of links: 3.5

The rook neighbours also form a grid, where the neighbours share a grid edge:

plot(st_geometry(grd), col=grd$col) 
plot(rook, xy, add=TRUE, col="grey")

(queen <- poly2nb(grd, queen=TRUE))
## Neighbour list object:
## Number of regions: 64 
## Number of nonzero links: 420 
## Percentage nonzero weights: 10.25391 
## Average number of links: 6.5625

The queen neighbours add neighbours sharing only one corner point:

plot(st_geometry(grd), col=grd$col) 
plot(queen, xy, add=TRUE, col="grey")

and the difference yields neighbours sharing not more than one boundary point:

plot(st_geometry(grd), col=grd$col) 
plot(diffnb(queen, rook), xy, add=TRUE, col="grey")

ex 14.4

We can access cardinalities using card(), and tabulate their frequencies for the chessboard rook case:

((rook %>% card() -> rc) %>% table() -> t)
## .
##  2  3  4 
##  4 24 36

Taking the counts found, we can construct the weights corresponding to those neighbour counts:

1/rev(as.numeric(names(t)))
## [1] 0.2500000 0.3333333 0.5000000

Plotting the row-standardized weights, we see that they up-weight the neighbours of observations with few neighbours, and down-weight the neighbours of observations with more neighbours:

grd$rc <- as.factor(1/rc)
<<<<<<< HEAD
plot(grd[, "rc"], main="rook row-standardized weights", key.width = lcm(2.5))

We can also use the cardinality frequency table to find counts of neighbours with (increasing) weights:

======= plot(grd[, "rc"], main="rook row-standardized weights")

We can also use the cardinality frequency table to find counts of neighbours with (increasing) weights:

>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee
unname(rev(t))*rev(as.numeric(names(t)))
## [1] 144  72   8

This can be confirmed by tabulating the frequencies of weights yielded by nb2listw():

table(unlist(nb2listw(rook, style="W")$weights))
## 
##              0.25 0.333333333333333               0.5 
##               144                72                 8

Repeating for the queen case again shows how row-standardization can engender edge effects:

((queen %>% card() -> rc) %>% table() -> t)
## .
##  3  5  8 
##  4 24 36
1/rev(as.numeric(names(t)))
## [1] 0.1250000 0.2000000 0.3333333
grd$rc <- as.factor(1/rc)
<<<<<<< HEAD
plot(grd[, "rc"], main = "rook row-standardised weights", key.width = lcm(2.5))

======= plot(grd[, "rc"], main="rook row-standardised weights")

>>>>>>> 46d70cd7a317d2fd44af3c48309e03675fb52cee
unname(rev(t))*rev(as.numeric(names(t)))
## [1] 288 120  12
table(unlist(nb2listw(queen, style="W")$weights))
## 
##             0.125               0.2 0.333333333333333 
##               288               120                12

Chapter 15: Measures of spatial autocorrelation

ex 15.1

Re-using the objects from exercise 14.3, we have:

(grd$col %>% factor() -> COL) %>% table()
## .
## black white 
##    32    32

In the rook case, no black:black or white:white neighbours are found, differing greatly from the expected values, which are based on non-free sampling from the proportions of colours in the data. Highly significant spatial autocorrelation is detected:

(jc_r <- joincount.multi(COL, listw=nb2listw(rook, style="B")))
##             Joincount Expected Variance z-value
## black:black     0.000   27.556    8.325 -9.5503
## white:white     0.000   27.556    8.325 -9.5503
## white:black   112.000   56.889   27.205 10.5662
## Jtot          112.000   56.889   27.205 10.5662

In the queen neighbour case, no spatial autocorrelation is found, despite a chessboard looking spatially structured:

joincount.multi(COL, listw=nb2listw(queen, style="B"))
##             Joincount Expected Variance z-value
## black:black    49.000   51.667   23.616 -0.5487
## white:white    49.000   51.667   23.616 -0.5487
## white:black   112.000  106.667   47.796  0.7714
## Jtot          112.000  106.667   47.796  0.7714

This is because we have chosen to see all eight neighbour grid cells as neighbours (away from the edges of the board), so the two categories occur equally often as neighbour values, as expected.

ex 15.2

First, create an uncorrelated variable, and confirm that it is uncorrelated:

set.seed(1)
x <- rnorm(nrow(grd))
moran.test(x, nb2listw(queen, style="W"), randomisation=FALSE, alternative="two.sided")
## 
##  Moran I test under normality
## 
## data:  x  
## weights: nb2listw(queen, style = "W")    
## 
## Moran I statistic standard deviate = 0.27024, p-value = 0.787
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.002292497      -0.015873016       0.004518556

Next inject patterning into the variable by adding a linear trend:

x_t <- x + (0.15 * xy$Var1)
moran.test(x_t, nb2listw(queen, style="W"), randomisation=FALSE, alternative="two.sided")
## 
##  Moran I test under normality
## 
## data:  x_t  
## weights: nb2listw(queen, style = "W")    
## 
## Moran I statistic standard deviate = 3.0064, p-value = 0.002644
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.186218071      -0.015873016       0.004518556

Test again having taken the residuals from a linear model removing the injected trend:

lm.morantest(lm(x_t ~ xy$Var1), nb2listw(queen, style="W"), alternative="two.sided")
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = x_t ~ xy$Var1)
## weights: nb2listw(queen, style = "W")
## 
## Moran I statistic standard deviate = 0.28036, p-value = 0.7792
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     -0.012449399     -0.030600358      0.004191544

This is important to understand because the spatial patterning in a variable of interest, and picked up by a global measure of spatial autocorrelation, may be driven by an omitted variable. If we cannot add that variable, a latent variable or mixed effects model may be a good choice.

ex 15.3

False discovery rate adjustment is required when conducting repeated tests on the same data set. Usually, local measures of spatial autocorrelation are calculated for all the observations in a data set, and so constitute repeated tests. When repeated tests are conducted, the usual reading of confidence intervals and probability values must be adjusted to take the repeated use of the data into account.

ex 15.4

If we start with the standard local Moran’s \(I_i\) for the random values with a slight 1D trend, upgraded to analytical conditional standard deviates, but with only the standard intercept-only mean model, we have a starting point; a fair number of the values exceed 2:

locm <- localmoran(x_t, nb2listw(queen, style="W"))
plot(density(locm[, 4]))
abline(v=c(-2, 2))

grd$locm_sd <- locm[, 4]
plot(grd[, "locm_sd"]) 

sum(p.adjust(locm[, 5], method="none") < 0.05)
## [1] 6

If we apply false discovery rate adjustment, we have just one significant measure:

sum(p.adjust(locm[, 5], method="fdr") < 0.05)
## [1] 1

In the first Saddlepoint approximation also for the random values with a slight 1D trend, the distribution of standard deviates shifts leftward, with both positive and negative values beyond abs(2):

lm_null <- lm(x_t ~ 1)
locm_null <- summary(localmoran.sad(lm_null, nb=queen, style="W"))
plot(density(locm_null[, "Saddlepoint"]))
abline(v=c(-2, 2))

grd$locm_null_sd <- locm_null[, "Saddlepoint"]
plot(grd[, "locm_null_sd"]) 

sum(p.adjust(locm_null[, "Pr. (Sad)"], method="none") < 0.05)
## [1] 8

If we apply false discovery rate adjustment, we also have just one significant measure:

sum(p.adjust(locm_null[, "Pr. (Sad)"], method="fdr") < 0.05)
## [1] 1

Once we analyse a model including the 1D trend, most of the distribution of standard deviate values is between -2 and 2:

lm_trend <- lm(x_t ~ xy$Var1)
locm_tr <- summary(localmoran.sad(lm_trend, nb=queen, style="W"))
plot(density(locm_tr[, "Saddlepoint"]))
abline(v=c(-2, 2))

grd$locm_tr_sd <- locm_tr[, "Saddlepoint"]
plot(grd[, "locm_tr_sd"]) 

sum(p.adjust(locm_tr[, "Pr. (Sad)"], method="none") < 0.05)
## [1] 2

If we apply false discovery rate adjustment, we now have no significant measures, as expected:

sum(p.adjust(locm_tr[, "Pr. (Sad)"], method="fdr") < 0.05)
## [1] 0

localmoran.sad() or localmoran.exact() provide both richer mean models, and estimates of the standard deviates built on the underlying spatial relationships for each observation, rather than analytical or permutation assumptions for the whole data set. This is achieved at the cost of longer compute times and larger memory use, especially when the Omega= argument to localmoran.sad() or localmoran.exact.alt() is used, because this is a dense \(n \times n\) matrix.

Chapter 16: Spatial Regression

ex 16.1

The HSAR package includes an upper level polygon support municipality department data set, ans a lower level property data set. Both are "sf" objects, in the same projected CRS.

library(sf)
library(HSAR)
data(depmunic)
data(properties)
depmunic$popdens <- depmunic$population/ (10000*depmunic$area)
depmunic$foreigners <- 100 * depmunic$pop_rest/ depmunic$population
depmunic$prgreensp <- depmunic$greensp/ (10000*depmunic$area)

In the vignette, two upper-level variables are added to the six already present, and we change the green space variable scaling to avoid numerical issues in calculating coefficient standard errors.

summary(depmunic)
##     num_dep        airbnb          museums         population    
##  Min.   :1.0   Min.   : 144.0   Min.   : 0.000   Min.   : 45168  
##  1st Qu.:2.5   1st Qu.: 377.5   1st Qu.: 0.000   1st Qu.: 78753  
##  Median :4.0   Median : 565.0   Median : 0.000   Median : 98283  
##  Mean   :4.0   Mean   : 712.3   Mean   : 2.571   Mean   : 93702  
##  3rd Qu.:5.5   3rd Qu.: 675.5   3rd Qu.: 0.500   3rd Qu.:112677  
##  Max.   :7.0   Max.   :2171.0   Max.   :17.000   Max.   :129603  
##     pop_rest        greensp            area                geometry
##  Min.   : 2735   Min.   : 40656   Min.   :3.987   POLYGON      :7  
##  1st Qu.: 4588   1st Qu.: 42340   1st Qu.:4.264   epsg:2100    :0  
##  Median : 5099   Median : 93715   Median :4.836   +proj=tmer...:0  
##  Mean   : 7109   Mean   :183796   Mean   :5.420                    
##  3rd Qu.: 8110   3rd Qu.:294286   3rd Qu.:6.412                    
##  Max.   :16531   Max.   :478951   Max.   :7.764                    
##     popdens         foreigners       prgreensp     
##  Min.   :0.8039   Min.   : 4.890   Min.   :0.7708  
##  1st Qu.:1.2979   1st Qu.: 5.058   1st Qu.:0.9653  
##  Median :1.8763   Median : 6.055   Median :1.2070  
##  Mean   :1.8697   Mean   : 7.369   Mean   :3.3883  
##  3rd Qu.:2.2807   3rd Qu.: 8.882   3rd Qu.:4.9527  
##  Max.   :3.2508   Max.   :12.755   Max.   :9.9040

The properties data set has only four variables, but with price per square metre already added:

summary(properties)
##       id                 size             price             prpsqm      
##  Length:1000        Min.   :  21.00   Min.   :   8000   Min.   : 207.5  
##  Class :character   1st Qu.:  55.00   1st Qu.:  44750   1st Qu.: 631.1  
##  Mode  :character   Median :  75.00   Median :  80000   Median :1098.8  
##                     Mean   :  83.17   Mean   : 123677   Mean   :1316.9  
##                     3rd Qu.:  98.00   3rd Qu.: 147000   3rd Qu.:1814.2  
##                     Max.   :1250.00   Max.   :5500000   Max.   :9166.7  
##       age          dist_metro                geometry   
##  Min.   : 0.00   Min.   :   4.423   POINT        :1000  
##  1st Qu.:11.00   1st Qu.: 338.523   epsg:2100    :   0  
##  Median :42.00   Median : 537.250   +proj=tmer...:   0  
##  Mean   :34.16   Mean   : 610.772                       
##  3rd Qu.:48.00   3rd Qu.: 819.929                       
##  Max.   :67.00   Max.   :1888.856

The values of the variables in depmunic get copied to each of the properties falling within the boundaries of the municipality departments:

properties_in_dd <- st_join(properties, depmunic, join = st_within)

ex 16.2

For polygon support, we prefer contiguous neighbours:

(mun_nb <- spdep::poly2nb(depmunic, row.names=as.character(depmunic$num_dep)))
## Neighbour list object:
## Number of regions: 7 
## Number of nonzero links: 20 
## Percentage nonzero weights: 40.81633 
## Average number of links: 2.857143

Global spatial autocorrelation is marginally detected for the green space variable:

spdep::moran.test(depmunic$prgreensp, nb2listw(mun_nb))
## 
##  Moran I test under randomisation
## 
## data:  depmunic$prgreensp  
## weights: nb2listw(mun_nb)    
## 
## Moran I statistic standard deviate = 1.825, p-value = 0.034
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.22384238       -0.16666667        0.04578844

Unlike the vignette, which uses distance neighbours up to 1300 m and creates a very dense representation, we choose k=4 k-nearest neighbours, then convert to symmetry (note that some point locations are duplicated, preventing the use of spatial indexing):

(pr_nb_k4s <- spdep::knn2nb(spdep::knearneigh(properties, k=4), sym=TRUE, row.names=properties$id))
## Warning in spdep::knearneigh(properties, k = 4): knearneigh: identical points
## found
## Warning in spdep::knearneigh(properties, k = 4): knearneigh: kd_tree not
## available for identical points
## Neighbour list object:
## Number of regions: 1000 
## Number of nonzero links: 5874 
## Percentage nonzero weights: 0.5874 
## Average number of links: 5.874

Copying out has led to the introduction of very powerful positive spatial autocorrelation in this and other variables copied out:

spdep::moran.test(properties_in_dd$prgreensp, nb2listw(pr_nb_k4s))
## 
##  Moran I test under randomisation
## 
## data:  properties_in_dd$prgreensp  
## weights: nb2listw(pr_nb_k4s)    
## 
## Moran I statistic standard deviate = 51.666, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.9757099996     -0.0010010010      0.0003573715

ex 16.3

The vignette proposes the full property level and municipal department level set of variables straight away. Here we choose the proerty level ones first, and update for the copied out municipal department level ones next:

f_pr <- prpsqm ~ size + age + dist_metro
f_pr_md <- update(f_pr, . ~ . + foreigners + prgreensp + popdens + museums + airbnb)

Adding in the copied out upper level variables appears to account for more of the variability of the response than leaving them out:

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
pr_base <- gam(f_pr, data=properties_in_dd)
pr_2lev <- gam(f_pr_md, data=properties_in_dd)
anova(pr_base, pr_2lev, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb
##   Resid. Df Resid. Dev Df  Deviance  Pr(>Chi)    
## 1       996  625149227                           
## 2       991  524571512  5 100577716 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_base)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1702.61693   78.59752   21.66  < 2e-16 ***
## size           5.18681    0.46510   11.15  < 2e-16 ***
## age          -18.11938    1.33737  -13.55  < 2e-16 ***
## dist_metro    -0.32446    0.07115   -4.56 5.75e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.234   Deviance explained = 23.7%
## GCV = 6.3018e+05  Scale est. = 6.2766e+05  n = 1000
summary(pr_2lev)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1648.17176  189.84951   8.681   <2e-16 ***
## size           4.29551    0.43386   9.901   <2e-16 ***
## age          -20.18585    1.27009 -15.893   <2e-16 ***
## dist_metro    -0.15180    0.07159  -2.120   0.0342 *  
## foreigners   -38.13473   22.54311  -1.692   0.0910 .  
## prgreensp     23.90080   16.33291   1.463   0.1437    
## popdens      -51.82590  103.65578  -0.500   0.6172    
## museums      -19.06125   21.27904  -0.896   0.3706    
## airbnb         0.63284    0.32887   1.924   0.0546 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.354   Deviance explained =   36%
## GCV = 5.3414e+05  Scale est. = 5.2934e+05  n = 1000

Adding an upper level IID random effect to the base formula also improves the fit of the model substantially:

pr_base_iid <- gam(update(f_pr, . ~ . + s(num_dep, bs="re")), data=properties_in_dd)
anova(pr_base, pr_base_iid, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
##   Resid. Df Resid. Dev      Df Deviance  Pr(>Chi)    
## 1       996  625149227                               
## 2       995  557913777 0.99993 67235450 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_base_iid)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2244.07579   89.35422  25.114  < 2e-16 ***
## size           4.63594    0.44249  10.477  < 2e-16 ***
## age          -19.12405    1.26739 -15.089  < 2e-16 ***
## dist_metro    -0.18383    0.06848  -2.685  0.00738 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value    
## s(num_dep) 0.9916      1 118.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.316   Deviance explained = 31.9%
## GCV = 5.6353e+05  Scale est. = 5.6071e+05  n = 1000

This improvement is much more moderate when both the upper level variables and IID random effect are present:

pr_2lev_iid <- gam(update(f_pr_md, . ~ . + s(num_dep, bs="re")), data=properties_in_dd)
anova(pr_2lev, pr_2lev_iid, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb + s(num_dep, bs = "re")
##   Resid. Df Resid. Dev      Df Deviance Pr(>Chi)  
## 1    991.00  524571512                            
## 2    990.04  522144057 0.95676  2427454  0.02981 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_2lev_iid)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb + s(num_dep, bs = "re")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1520.74633  200.41291   7.588 7.48e-14 ***
## size           4.29515    0.43303   9.919  < 2e-16 ***
## age          -20.09458    1.26851 -15.841  < 2e-16 ***
## dist_metro    -0.14574    0.07152  -2.038  0.04184 *  
## foreigners   -83.04410   32.17854  -2.581  0.01000 *  
## prgreensp    -68.29107   49.95934  -1.367  0.17196    
## popdens      261.72640  191.05195   1.370  0.17102    
## museums     -125.97470   58.73993  -2.145  0.03223 *  
## airbnb         1.92092    0.73695   2.607  0.00928 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value  
## s(num_dep) 0.7921      1 3.811  0.0285 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.357   Deviance explained = 36.3%
## GCV = 5.3252e+05  Scale est. = 5.2731e+05  n = 1000

ex 16.4

The "mrf" smooth term needs ID keys set so that the neighbour object is correctly matched to the observations. Once these are provided, the properties level model with a municipality department level MRF smooth may be fit:

names(mun_nb) <- attr(mun_nb, "region.id")
properties_in_dd$num_dep <- factor(properties_in_dd$num_dep)
pr_base_mrf <- gam(update(f_pr, . ~ . + s(num_dep, bs="mrf", xt=list(nb=mun_nb))),
    data=properties_in_dd)
summary(pr_base_mrf)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1730.81146   76.44218  22.642   <2e-16 ***
## size           4.32010    0.43280   9.982   <2e-16 ***
## age          -20.00432    1.26761 -15.781   <2e-16 ***
## dist_metro    -0.14718    0.07142  -2.061   0.0396 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df    F p-value    
## s(num_dep) 5.852      6 31.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.357   Deviance explained = 36.3%
## GCV = 5.3255e+05  Scale est. = 5.2731e+05  n = 1000

Repeating for the extended model with upper level variables present, we see that no more response variability is accounted for than in the lower level variables only MRF RE model, and none of the upper level variables are significant at conventional levels.

pr_2lev_mrf <- gam(update(f_pr_md, . ~ . + s(num_dep, bs="mrf", xt=list(nb=mun_nb))),
    data=properties_in_dd)
summary(pr_2lev_mrf)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1510.58909  425.16218   3.553 0.000399 ***
## size           4.29515    0.43303   9.919  < 2e-16 ***
## age          -20.09456    1.26851 -15.841  < 2e-16 ***
## dist_metro    -0.14574    0.07152  -2.038 0.041844 *  
## foreigners   -42.18041   58.30966  -0.723 0.469613    
## prgreensp     15.34415   54.84749   0.280 0.779720    
## popdens      -50.09688  259.59932  -0.193 0.847016    
## museums      -53.40432   55.25567  -0.966 0.334033    
## airbnb         1.01045    0.81453   1.241 0.215072    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value  
## s(num_dep) 0.7922      2 1.906  0.0285 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.357   Deviance explained = 36.3%
## GCV = 5.3252e+05  Scale est. = 5.2731e+05  n = 1000

It also seems that the model without upper level variables outperforms that with them included:

anova(pr_base_mrf, pr_2lev_mrf, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
##   Resid. Df Resid. Dev        Df Deviance Pr(>Chi)  
## 1    990.01  522113016                              
## 2    990.04  522143950 -0.037064   -30934    0.054 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the MRF RE outperforms the IID RE:

anova(pr_base_mrf, pr_base_iid, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## Model 2: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
##   Resid. Df Resid. Dev      Df  Deviance  Pr(>Chi)    
## 1    990.01  522113016                                
## 2    995.00  557913777 -4.9939 -35800762 2.786e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 17: Spatial econometrics models

ex 17.1

First create a spatial weights object from the k=4 symmetrized neighbour object:

library(spatialreg)
## Loading required package: Matrix
## 
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
## 
##     get.ClusterOption, get.coresOption, get.mcOption,
##     get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
##     set.coresOption, set.mcOption, set.VerboseOption,
##     set.ZeroPolicyOption
lw <- nb2listw(pr_nb_k4s)

Fit a linear model to the lower-level data; all the included variables seem worth retaining:

LM_pr <- lm(f_pr, data=properties_in_dd)
summary(LM_pr)
## 
## Call:
## lm(formula = f_pr, data = properties_in_dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5691.2  -456.7  -202.2   251.9  6872.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1702.61693   78.59752   21.66  < 2e-16 ***
## size           5.18681    0.46510   11.15  < 2e-16 ***
## age          -18.11938    1.33737  -13.55  < 2e-16 ***
## dist_metro    -0.32446    0.07115   -4.56 5.75e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 792.2 on 996 degrees of freedom
## Multiple R-squared:  0.2368, Adjusted R-squared:  0.2345 
## F-statistic:   103 on 3 and 996 DF,  p-value: < 2.2e-16

However, there is strong residual autocorrelation:

lm.morantest(LM_pr, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
## 
## Moran I statistic standard deviate = 26.039, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4872794374    -0.0023026155     0.0003535235

Robust Lagrange multiplier tests suggest that the fitted model should include a spatial autoregressive process in the residuals, but not in the response:

lm.LMtests(LM_pr, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
## 
## RLMerr = 150.88, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
## 
## RLMlag = 1.2688, df = 1, p-value = 0.26

Adding in the copied out municipality department level variables, we see that they do not seem to be worth retaining (unless there are good reasons for doing so); they do however improve model fit:

LM_pr_md <- lm(f_pr_md, data=properties_in_dd)
summary(LM_pr_md)
## 
## Call:
## lm(formula = f_pr_md, data = properties_in_dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5175.9  -371.6  -107.1   266.7  6648.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1648.17176  189.84951   8.681   <2e-16 ***
## size           4.29551    0.43386   9.901   <2e-16 ***
## age          -20.18585    1.27009 -15.893   <2e-16 ***
## dist_metro    -0.15180    0.07159  -2.120   0.0342 *  
## foreigners   -38.13473   22.54311  -1.692   0.0910 .  
## prgreensp     23.90080   16.33291   1.463   0.1437    
## popdens      -51.82590  103.65578  -0.500   0.6172    
## museums      -19.06125   21.27904  -0.896   0.3706    
## airbnb         0.63284    0.32887   1.924   0.0546 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 727.6 on 991 degrees of freedom
## Multiple R-squared:  0.3596, Adjusted R-squared:  0.3544 
## F-statistic: 69.55 on 8 and 991 DF,  p-value: < 2.2e-16

The pre-test results are similar to those for the properties-only variables:

lm.morantest(LM_pr_md, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
## 
## Moran I statistic standard deviate = 23.707, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4307327715    -0.0070398073     0.0003410046

and the LM tests continue to indicate an omitted spatial process in the residual rather than the response:

lm.LMtests(LM_pr_md, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
## 
## RLMerr = 116.67, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
## 
## RLMlag = 0.0035207, df = 1, p-value = 0.9527

ex 17.2

We may update the formula for the properties-only model to include municipality department “fixed effects”, dummy variables:

LM_pr_fx <- lm(update(f_pr, . ~ . + num_dep), data=properties_in_dd)
summary(LM_pr_fx)
## 
## Call:
## lm(formula = update(f_pr, . ~ . + num_dep), data = properties_in_dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5175.3  -377.2   -97.9   272.7  6644.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.352e+03  9.816e+01  23.964  < 2e-16 ***
## size         4.295e+00  4.330e-01   9.919  < 2e-16 ***
## age         -2.007e+01  1.269e+00 -15.819  < 2e-16 ***
## dist_metro  -1.442e-01  7.154e-02  -2.015 0.044176 *  
## num_dep2    -3.342e+02  8.695e+01  -3.844 0.000129 ***
## num_dep3    -4.896e+02  1.290e+02  -3.794 0.000157 ***
## num_dep4    -1.031e+03  1.193e+02  -8.641  < 2e-16 ***
## num_dep5    -8.222e+02  8.251e+01  -9.964  < 2e-16 ***
## num_dep6    -9.183e+02  7.861e+01 -11.681  < 2e-16 ***
## num_dep7    -6.527e+02  8.250e+01  -7.911 6.78e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 726.2 on 990 degrees of freedom
## Multiple R-squared:  0.3627, Adjusted R-squared:  0.3569 
## F-statistic:  62.6 on 9 and 990 DF,  p-value: < 2.2e-16

The pre-test output is similar to that for the models considered above:

lm.morantest(LM_pr_fx, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
## 
## Moran I statistic standard deviate = 23.611, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4268054149    -0.0079765936     0.0003390813
lm.LMtests(LM_pr_fx, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
## 
## RLMerr = 117.99, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
## 
## RLMlag = 0.030669, df = 1, p-value = 0.861

We may fit a regimes model, where separate regression coefficients are calculated for interactions between the municipality department dummies and the included variables; size and dist_metro only retian influence for municipality departments 1 and 2:

LM_pr_reg <- lm(update(f_pr, . ~ num_dep/(0 + .)), data=properties_in_dd)
summary(LM_pr_reg)
## 
## Call:
## lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data = properties_in_dd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2055.0  -328.6   -62.8   257.2  6353.5 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## num_dep1            2366.55794  206.45952  11.463  < 2e-16 ***
## num_dep2            1114.08697  167.84304   6.638 5.29e-11 ***
## num_dep3            2095.07546  424.99044   4.930 9.68e-07 ***
## num_dep4            1351.29266  465.10378   2.905 0.003752 ** 
## num_dep5            1695.64352  234.29757   7.237 9.30e-13 ***
## num_dep6            1611.35873  167.16126   9.640  < 2e-16 ***
## num_dep7            1705.22449  176.18849   9.678  < 2e-16 ***
## num_dep1:size          1.36164    0.51169   2.661 0.007918 ** 
## num_dep2:size         14.62370    0.99848  14.646  < 2e-16 ***
## num_dep3:size          2.25680    4.50973   0.500 0.616886    
## num_dep4:size          3.79360    4.56205   0.832 0.405864    
## num_dep5:size          3.06546    1.93716   1.582 0.113872    
## num_dep6:size          1.49313    1.33800   1.116 0.264724    
## num_dep7:size          5.91935    1.37694   4.299 1.89e-05 ***
## num_dep1:age          -6.83169    3.55084  -1.924 0.054651 .  
## num_dep2:age          -9.36978    3.07762  -3.044 0.002393 ** 
## num_dep3:age         -18.07526    5.65924  -3.194 0.001449 ** 
## num_dep4:age         -22.43736    5.03824  -4.453 9.43e-06 ***
## num_dep5:age         -27.07143    2.71096  -9.986  < 2e-16 ***
## num_dep6:age         -23.33611    2.34894  -9.935  < 2e-16 ***
## num_dep7:age         -24.34731    2.65955  -9.155  < 2e-16 ***
## num_dep1:dist_metro   -0.72509    0.21429  -3.384 0.000744 ***
## num_dep2:dist_metro   -0.61219    0.17457  -3.507 0.000474 ***
## num_dep3:dist_metro   -0.56261    0.52693  -1.068 0.285918    
## num_dep4:dist_metro    0.02032    0.43294   0.047 0.962570    
## num_dep5:dist_metro    0.10681    0.29026   0.368 0.712965    
## num_dep6:dist_metro    0.05074    0.09829   0.516 0.605810    
## num_dep7:dist_metro   -0.14110    0.14727  -0.958 0.338274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 663.8 on 972 degrees of freedom
## Multiple R-squared:  0.8323, Adjusted R-squared:  0.8274 
## F-statistic: 172.2 on 28 and 972 DF,  p-value: < 2.2e-16

The pre-test results are now changed, with possible spatial processes in both residuals and response being indicated:

lm.morantest(LM_pr_reg, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
## 
## Moran I statistic standard deviate = 18.521, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.3198250101    -0.0139390950     0.0003247394
lm.LMtests(LM_pr_reg, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
## 
## RLMerr = 40.873, df = 1, p-value = 1.625e-10
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
## 
## RLMlag = 19.055, df = 1, p-value = 1.27e-05

ex 17.3

Fitting models initially by maximum likelihood (GMM may also be used), we pre-compute the eigenvalues:

eigs <- eigenw(lw)

The strong residual autocorrelation is picked up by the spatial coefficient, but unfortunately the Hausman test shows strong mis-specification:

SEM_pr <- errorsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=FALSE,
    control=list(pre_eig=eigs))
summary(SEM_pr, Hausman=TRUE)
## 
## Call:errorsarlm(formula = f_pr, data = properties_in_dd, listw = lw, 
##     Durbin = FALSE, control = list(pre_eig = eigs))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3214.224  -337.719   -75.561   196.920  5411.793 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error  z value Pr(>|z|)
## (Intercept) 1999.09195  120.70305  16.5621  < 2e-16
## size           3.21092    0.35840   8.9591  < 2e-16
## age          -23.64759    1.06426 -22.2199  < 2e-16
## dist_metro    -0.27302    0.14694  -1.8581  0.06315
## 
## Lambda: 0.6927, LR test value: 459.76, p-value: < 2.22e-16
## Asymptotic standard error: 0.024179
##     z-value: 28.649, p-value: < 2.22e-16
## Wald statistic: 820.74, p-value: < 2.22e-16
## 
## Log likelihood: -7861.931 for error model
## ML residual variance (sigma squared): 354240, (sigma: 595.18)
## Number of observations: 1000 
## Number of parameters estimated: 6 
## AIC: 15736, (AIC for lm: 16194)
## Hausman test: -246.29, df: 4, p-value: < 2.22e-16

The Hausman test compares the OLS and SEM coefficient estimates and their standard errors, assessing whether their distributions overlap sufficiently to suggest the absence of major mis-specification:

(LM_coefs <- coef(summary(LM_pr)))
##                 Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept) 1702.6169273 78.59752134  21.662476 1.432984e-85
## size           5.1868075  0.46509907  11.152049 2.679763e-27
## age          -18.1193826  1.33737467 -13.548472 1.662323e-38
## dist_metro    -0.3244594  0.07115232  -4.560068 5.749564e-06
(SEM_coefs <- coef(summary(SEM_pr)))
##                 Estimate  Std. Error    z value   Pr(>|z|)
## (Intercept) 1999.0919466 120.7030525  16.562066 0.00000000
## size           3.2109208   0.3583989   8.959070 0.00000000
## age          -23.6475917   1.0642550 -22.219854 0.00000000
## dist_metro    -0.2730234   0.1469362  -1.858108 0.06315365

The tables are harder to read than the figure, which shows that the coefficient estimates do differ a lot for two variables, somewhat for the intercept, and little for one variable, but where the ML standard error estimate under usual assumptions crosses zero:

opar <- par(mfrow=c(2,2))
plot(1, type="n", xlim=c(1400, 2500), ylim=c(0, 0.006), xlab=rownames(LM_coefs)[1], ylab="")
curve(dnorm(x, mean=LM_coefs[1,1], sd=LM_coefs[1,2]), add=TRUE)
abline(v=LM_coefs[1,1])
abline(v=SEM_coefs[1,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[1,1], sd=SEM_coefs[1,2]), add=TRUE, col="orange", lwd=2)
legend("topright", legend=c("LM", "SEM"), col=c("black", "orange"), lwd=1:2, bty="n")
plot(1, type="n", xlim=c(1.5, 7), ylim=c(0, 1.1), xlab=rownames(LM_coefs)[2], ylab="")
curve(dnorm(x, mean=LM_coefs[2,1], sd=LM_coefs[2,2]), add=TRUE)
abline(v=LM_coefs[2,1])
abline(v=SEM_coefs[2,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[2,1], sd=SEM_coefs[2,2]), add=TRUE, col="orange", lwd=2)
plot(1, type="n", xlim=c(-28, -13), ylim=c(0, 0.4), xlab=rownames(LM_coefs)[3], ylab="")
curve(dnorm(x, mean=LM_coefs[3,1], sd=LM_coefs[3,2]), add=TRUE)
abline(v=LM_coefs[3,1])
abline(v=SEM_coefs[3,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[3,1], sd=SEM_coefs[3,2]), add=TRUE, col="orange", lwd=2)
plot(1, type="n", xlim=c(-0.9, 0.3), ylim=c(0, 6), xlab=rownames(LM_coefs)[4], ylab="")
curve(dnorm(x, mean=LM_coefs[4,1], sd=LM_coefs[4,2]), add=TRUE)
abline(v=LM_coefs[4,1])
abline(v=SEM_coefs[4,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[4,1], sd=SEM_coefs[4,2]), add=TRUE, col="orange", lwd=2)

par(opar)

The Hausman test also suggests mis-specification for the SEM model augmented with the municipality department level variables:

SEM_pr_md <- errorsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=FALSE,
    control=list(pre_eig=eigs))
summary(SEM_pr_md, Hausman=TRUE)
## 
## Call:errorsarlm(formula = f_pr_md, data = properties_in_dd, listw = lw, 
##     Durbin = FALSE, control = list(pre_eig = eigs))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3421.923  -320.793   -65.301   224.785  5452.101 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error  z value  Pr(>|z|)
## (Intercept) 1945.79330  333.00265   5.8432 5.121e-09
## size           3.05839    0.35711   8.5643 < 2.2e-16
## age          -23.45168    1.06301 -22.0616 < 2.2e-16
## dist_metro    -0.14969    0.13214  -1.1328    0.2573
## foreigners    -7.66310   42.72513  -0.1794    0.8577
## prgreensp     38.40538   31.36427   1.2245    0.2208
## popdens     -211.52190  190.49687  -1.1104    0.2668
## museums      -33.84713   39.33249  -0.8605    0.3895
## airbnb         0.59183    0.60062   0.9854    0.3244
## 
## Lambda: 0.62414, LR test value: 335.94, p-value: < 2.22e-16
## Asymptotic standard error: 0.028289
##     z-value: 22.063, p-value: < 2.22e-16
## Wald statistic: 486.78, p-value: < 2.22e-16
## 
## Log likelihood: -7836.138 for error model
## ML residual variance (sigma squared): 345330, (sigma: 587.65)
## Number of observations: 1000 
## Number of parameters estimated: 11 
## AIC: 15694, (AIC for lm: 16028)
## Hausman test: -1348.4, df: 9, p-value: < 2.22e-16

Extending to the SDEM models, and reporting impacts:

SDEM_pr <- errorsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
    control=list(pre_eig=eigs))
summary(impacts(SDEM_pr), short=TRUE, zstats=TRUE)
## Impact measures (SDEM, estimable, n):
##                 Direct   Indirect       Total
## size         3.4445402  2.0126058   5.4571460
## age        -21.9222898 10.9931411 -10.9291487
## dist_metro   0.2000411 -0.5494199  -0.3493789
## ========================================================
## Standard errors:
##               Direct  Indirect     Total
## size       0.3983375 1.0053504 1.2310345
## age        1.1425638 2.5191829 3.1162005
## dist_metro 0.2734082 0.3104821 0.1551602
## ========================================================
## Z-values:
##                Direct  Indirect     Total
## size         8.647291  2.001895  4.432976
## age        -19.186928  4.363772 -3.507203
## dist_metro   0.731657 -1.769570 -2.251730
## 
## p-values:
##            Direct  Indirect   Total     
## size       < 2e-16 0.045296   9.2941e-06
## age        < 2e-16 1.2784e-05 0.00045284
## dist_metro 0.46438 0.076799   0.02433931

we have Hausman test results still indicating strong mis-specification:

Hausman.test(SDEM_pr)
## 
##  Spatial Hausman test (asymptotic)
## 
## data:  NULL
## Hausman test = 53.765, df = 7, p-value = 2.618e-09

The same applies to the properties variables augmented with the municipality department level variables:

SDEM_pr_md <- errorsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE,
    control=list(pre_eig=eigs))
summary(impacts(SDEM_pr_md), short=TRUE, zstats=TRUE)
## Impact measures (SDEM, estimable, n):
##                  Direct     Indirect       Total
## size          3.2902831    1.6915614   4.9818445
## age         -22.5877018    7.0789016 -15.5088002
## dist_metro    0.2706259   -0.4643729  -0.1937470
## foreigners  107.6194776 -146.8457500 -39.2262724
## prgreensp    35.3436543  -12.5353752  22.8082791
## popdens    -456.7261794  400.0782288 -56.6479506
## museums      51.9819779  -93.3777533 -41.3957754
## airbnb       -0.8658815    1.7162902   0.8504087
## ========================================================
## Standard errors:
##                 Direct    Indirect       Total
## size         0.3900357   0.9742177   1.1837952
## age          1.1250931   2.5043048   3.0554700
## dist_metro   0.2747535   0.3097415   0.1459340
## foreigners  87.2520039  94.9604414  45.6522660
## prgreensp   74.3147381  79.3194509  33.1282709
## popdens    403.5523890 439.7109664 210.7386098
## museums     85.7152273  93.9719231  43.7578240
## airbnb       1.1452242   1.2862740   0.6730806
## ========================================================
## Z-values:
##                 Direct   Indirect      Total
## size         8.4358510  1.7363279  4.2083670
## age        -20.0762964  2.8266933 -5.0757495
## dist_metro   0.9849770 -1.4992275 -1.3276345
## foreigners   1.2334327 -1.5463887 -0.8592404
## prgreensp    0.4755941 -0.1580366  0.6884838
## popdens     -1.1317643  0.9098664 -0.2688067
## museums      0.6064497 -0.9936772 -0.9460200
## airbnb      -0.7560803  1.3343115  1.2634575
## 
## p-values:
##            Direct  Indirect  Total     
## size       < 2e-16 0.0825059 2.5722e-05
## age        < 2e-16 0.0047031 3.8597e-07
## dist_metro 0.32464 0.1338146 0.18430   
## foreigners 0.21741 0.1220107 0.39021   
## prgreensp  0.63436 0.8744280 0.49115   
## popdens    0.25773 0.3628930 0.78808   
## museums    0.54422 0.3203801 0.34414   
## airbnb     0.44960 0.1821018 0.20642
Hausman.test(SDEM_pr_md)
## 
##  Spatial Hausman test (asymptotic)
## 
## data:  NULL
## Hausman test = 229.24, df = 17, p-value < 2.2e-16

Reaching out the SLX models does not help, because although - as with the SDEM models - the indirect impacts (coefficients on lagged \(X\) variables) are large, so including lagged \(X\) variables especially at the properties level seems sensible, there is serious residual autocorrelation, and now the pre-test strategy points to a missing spatial process in the response:

SLX_pr <- lmSLX(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE)
summary(impacts(SLX_pr), short=TRUE, zstats=TRUE)
## Impact measures (SlX, estimable, n-k):
##                 Direct   Indirect     Total
## size         3.9424014  6.4922295 10.434631
## age        -22.6778266 13.6377582 -9.040068
## dist_metro   0.5710406 -0.8727816 -0.301741
## ========================================================
## Standard errors:
##               Direct  Indirect      Total
## size       0.4600099 0.8896271 0.90358915
## age        1.3642232 2.1630220 2.15389898
## dist_metro 0.3650804 0.3766962 0.07031201
## ========================================================
## Z-values:
##                Direct  Indirect     Total
## size         8.570254  7.297698 11.547982
## age        -16.623252  6.304956 -4.197072
## dist_metro   1.564150 -2.316938 -4.291458
## 
## p-values:
##            Direct  Indirect   Total     
## size       < 2e-16 2.9265e-13 < 2.22e-16
## age        < 2e-16 2.8828e-10 2.7039e-05
## dist_metro 0.11778 0.020507   1.7750e-05
lm.morantest(SLX_pr, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## Moran I statistic standard deviate = 24.275, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4505572652    -0.0029076149     0.0003489564
lm.LMtests(SLX_pr, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## RLMerr = 3.9929, df = 1, p-value = 0.04569
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## RLMlag = 54.185, df = 1, p-value = 1.825e-13
SLX_pr_md <- lmSLX(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE)
summary(impacts(SLX_pr_md), short=TRUE, zstats=TRUE)
## Impact measures (SlX, estimable, n-k):
##                  Direct     Indirect       Total
## size          3.6899279    5.1853598   8.8752878
## age         -22.4862582    7.7433691 -14.7428891
## dist_metro    0.4883417   -0.6216376  -0.1332958
## foreigners  111.8693211 -144.2912677 -32.4219466
## prgreensp    -7.9374540   33.7296361  25.7921821
## popdens    -175.9679694   76.3367323 -99.6312371
## museums     132.9274926 -150.5428303 -17.6153377
## airbnb       -1.6203107    2.0992718   0.4789610
## ========================================================
## Standard errors:
##                 Direct    Indirect        Total
## size         0.4374870   0.8640075   0.88540694
## age          1.2999436   2.1706052   2.20258646
## dist_metro   0.3469593   0.3606514   0.07353124
## foreigners 113.3256387 115.6423334  22.69480121
## prgreensp   98.1127344  99.5966256  16.36663957
## popdens    521.4661878 532.2960996 105.63203070
## museums    115.4990904 118.1243717  21.60621836
## airbnb       1.5181784   1.5638933   0.33577968
## ========================================================
## Z-values:
##                  Direct   Indirect      Total
## size         8.43437158  6.0015219 10.0239645
## age        -17.29787276  3.5673779 -6.6934440
## dist_metro   1.40748996 -1.7236520 -1.8127784
## foreigners   0.98714927 -1.2477374 -1.4286068
## prgreensp   -0.08090136  0.3386624  1.5758997
## popdens     -0.33744847  0.1434103 -0.9431915
## museums      1.15089645 -1.2744434 -0.8152902
## airbnb      -1.06727292  1.3423369  1.4264145
## 
## p-values:
##            Direct  Indirect   Total     
## size       < 2e-16 1.9548e-09 < 2.22e-16
## age        < 2e-16 0.00036057 2.1798e-11
## dist_metro 0.15928 0.08477068 0.069866  
## foreigners 0.32357 0.21212723 0.153117  
## prgreensp  0.93552 0.73486404 0.115049  
## popdens    0.73578 0.88596617 0.345583  
## museums    0.24977 0.20250631 0.414906  
## airbnb     0.28585 0.17948677 0.153749
lm.morantest(SLX_pr_md, lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## Moran I statistic standard deviate = 22.047, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.3978202243    -0.0070160067     0.0003371825
lm.LMtests(SLX_pr_md, lw, test=c("RLMerr", "RLMlag"))
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## RLMerr = 15.011, df = 1, p-value = 0.0001069
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## RLMlag = 56.758, df = 1, p-value = 4.929e-14

So on balance, the pre-test strategy has not worked out too well; it is unclear what is missing in the model.

ex 17.4

Turning to estimating the general nested model first, followed by excluding the Durbin (spatially lagged \(X\)) variables, a likelihood ratio test shows that the spatially lagged \(X\) variables should be retained in the model:

GNM_pr <- sacsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
    control=list(pre_eig1=eigs, pre_eig2=eigs))
SARAR_pr <- sacsarlm(f_pr, data=properties_in_dd, listw=lw, 
    control=list(pre_eig1=eigs, pre_eig2=eigs))
lmtest::lrtest(SARAR_pr, GNM_pr)
## Likelihood ratio test
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   7 -7814.6                         
## 2  10 -7783.9  3 61.303  3.096e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again using a likelihood ratio test, the GNM model outperforms the SDEM model:

lmtest::lrtest(SDEM_pr, GNM_pr)
## Likelihood ratio test
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   9 -7850.3                        
## 2  10 -7783.9  1 132.7  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SDM_pr <- lagsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
    control=list(pre_eig=eigs))

as is also the case with the SDM model:

lmtest::lrtest(SDM_pr, GNM_pr)
## Likelihood ratio test
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   9 -7842.4                         
## 2  10 -7783.9  1 117.06  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the SLX model:

lmtest::lrtest(SLX_pr, GNM_pr)
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"
## Likelihood ratio test
## 
## Model 1: y ~ size + age + dist_metro + lag.size + lag.age + lag.dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -8040.1                         
## 2  10 -7783.9  2 512.46  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is the inclusion of the municipality department level variables in the GNM model justified?

GNM_pr_md <- sacsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE,
    control=list(pre_eig1=eigs, pre_eig2=eigs))

No, not really:

lmtest::lrtest(GNM_pr, GNM_pr_md)
## Likelihood ratio test
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1  10 -7783.9                       
## 2  20 -7773.4 10 20.956     0.0214 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we drop the municipality department level variables from the Durbin term, we lose fewer degrees of freedom, so preferring the model including the municipality department level variables:

GNM_pr_md1 <- sacsarlm(f_pr_md, data=properties_in_dd, listw=lw, 
    Durbin= ~ size + age + dist_metro,
    control=list(pre_eig1=eigs, pre_eig2=eigs))
lmtest::lrtest(GNM_pr, GNM_pr_md1)
## Likelihood ratio test
## 
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens + 
##     museums + airbnb
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1  10 -7783.9                        
## 2  15 -7775.5  5 16.794   0.004908 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unfortunately, impacts are depressing here:

trs <- trW(as(lw, "CsparseMatrix"))
i_GNM_pr_md1 <- impacts(GNM_pr_md1, tr=trs, R=2000)
summary(i_GNM_pr_md1, short=TRUE, zstats=TRUE)
## Impact measures (sacmixed, trace):
##                  Direct    Indirect        Total
## size         3.23740995   7.2985861   10.5359961
## age        -23.03565197  10.9935858  -12.0420662
## dist_metro   0.26484655  -0.4603147   -0.1954681
## foreigners  -5.71030959 -23.1783890  -28.8886986
## prgreensp    5.63045876  22.8542711   28.4847299
## popdens    -20.87711878 -84.7411114 -105.6182302
## museums     -4.03610471 -16.3827204  -20.4188251
## airbnb       0.09458806   0.3839369    0.4785250
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##                Direct    Indirect       Total
## size        0.3675564   2.7172986   2.8701576
## age         1.0662779   6.2386275   6.6524247
## dist_metro  0.2172000   0.3157051   0.2011017
## foreigners 11.5239909  47.9039965  59.3689907
## prgreensp   8.4447492  34.9557814  43.3583761
## popdens    53.4568884 221.8429177 275.0663816
## museums    11.2137949  46.3673128  57.5286869
## airbnb      0.1731131   0.7184815   0.8906739
## 
## Simulated z-values:
##                 Direct   Indirect      Total
## size         8.8005035  2.6735477  3.6581644
## age        -21.5633924  1.7943591 -1.7735235
## dist_metro   1.2274254 -1.4816625 -1.0003479
## foreigners  -0.4812273 -0.4726568 -0.4747901
## prgreensp    0.6779955  0.6605050  0.6645537
## popdens     -0.3963452 -0.3823738 -0.3854135
## museums     -0.3397487 -0.3362316 -0.3372235
## airbnb       0.5265902  0.5183114  0.5204562
## 
## Simulated p-values:
##            Direct  Indirect  Total     
## size       < 2e-16 0.0075054 0.00025403
## age        < 2e-16 0.0727559 0.07614201
## dist_metro 0.21966 0.1384301 0.31714218
## foreigners 0.63035 0.6364580 0.63493658
## prgreensp  0.49777 0.5089298 0.50633600
## popdens    0.69185 0.7021841 0.69993106
## museums    0.73405 0.7366962 0.73594840
## airbnb     0.59848 0.6042411 0.60274564

The values and standard errors of the spatial coefficients suggest numerical problems in finding an optimum where the two coefficients are equally strong but with opposing signs:

c("response"=GNM_pr_md1$rho, "response se"=GNM_pr_md1$rho.se, "residual"=GNM_pr_md1$lambda, "residual se"=GNM_pr_md1$lambda.se)
##    response.rho     response se residual.lambda     residual se 
##      0.85971901      0.01547632     -0.86679315      0.04377824

If we fall back on the properties level only GNM, total impacts are only significant in conventional terms for size:

i_GNM_pr <- impacts(GNM_pr, tr=trs, R=2000)
summary(i_GNM_pr, short=TRUE, zstats=TRUE)
## Impact measures (sacmixed, trace):
##                 Direct   Indirect     Total
## size         3.3625439  9.2475200 12.610064
## age        -22.7517496 16.8353068 -5.916443
## dist_metro   0.2927224 -0.6523253 -0.359603
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##               Direct  Indirect     Total
## size       0.3745881 2.8837104 3.0388158
## age        1.0767407 6.5778940 7.0048158
## dist_metro 0.2211039 0.3082192 0.1990367
## 
## Simulated z-values:
##                Direct  Indirect      Total
## size         8.988053  3.189002  4.1341687
## age        -21.094711  2.605198 -0.7961408
## dist_metro   1.321073 -2.121656 -1.8179601
## 
## Simulated p-values:
##            Direct  Indirect  Total     
## size       < 2e-16 0.0014276 3.5624e-05
## age        < 2e-16 0.0091821 0.42595   
## dist_metro 0.18648 0.0338666 0.06907

The same problem occurs without the municipality department level variables; the impacts are being driven by the large spatial coefficient on the lagged response:

c("response"=GNM_pr$rho, "response se"=GNM_pr$rho.se, "residual"=GNM_pr$lambda, "residual se"=GNM_pr$lambda.se)
##    response.rho     response se residual.lambda     residual se 
##      0.88013826      0.01319649     -0.89698374      0.03914228

ex 17.5

We cannot say that the spatial econometrics approach has reached a clear conclusion. When including the upper level variables, we introduce a lot of spatial autocorrelation at the lower level. It is arguable that the MRF random effect at the upper level and including only the properties level variables gets at least as far as the most complex spatial econometrics models. It is fairly clear that mapping the actual green space and museums, and measuring distance from each property to the attractions would remove the scale problem for those variables. Disaggregation of the foreigners, airbnb and population density variables would be highly desirable. With improvements to the properties level data set, including more variables describing the properties themselves, much of the mis-specification should be removed.